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1  Summary 

The  research  conducted  under  this  grant  concerned  the  application  of  the  theory  of  partially  observ¬ 
able  Markov  decision  processes  (POMDPs)  to  the  design  of  guidance  algorithms  for  controlling  the 
motion  of  unmanned  aerial  vehicles  (UAVs)  with  on-board  sensors  to  improve  tracking  of  multiple 
ground  targets.  While  POMDP  problems  are  intractable  to  solve  exactly,  principled  approximation 
methods  can  be  devised  based  on  the  theory  that  characterizes  optimal  solutions.  A  new  approxi¬ 
mation  method  called  nominal  belief-state  optimization  (NBO)  was  proposed.  When  combined  with 
other  application-specific  approximations  and  techniques  within  the  POMDP  framework,  NBO  pro¬ 
duced  a  practical  design  that  coordinated  the  UAVs  to  achieve  good  long-term  mean- squared-error 
tracking  performance  in  the  presence  of  occlusions  and  dynamic  constraints.  The  flexibility  of  the 
design  was  demonstrated  by  extending  the  objective  to  reduce  the  probability  of  a  track  swap  in 
ambiguous  situations,  with  the  positive  side-effect  of  improving  the  mean-squared-error  tracking 
performance  as  well. 

The  personnel  contributing  to  this  research  are  Seott  Miller  and  Zachary  Harris  of  Numerica 
Corp.,  and  Prof.  Edwin  Chong  of  Colorado  State  University.  The  following  articles  were  produced 
as  an  outcome  of  this  grant:  [25,  7,  27,  20]. 

2  Introduction 

Interest  in  unmanned  aerial  vehicles  (UAVs)  for  applications  such  as  surveillance,  search,  and 
target  tracking  has  increased  in  recent  years,  owing  to  significant  progress  in  their  development  and 
a  number  of  recognized  advantages  in  their  use  [12,  41]. 

This  report  describes  a  principled  framework  for  designing  a  planning  and  coordination  algo¬ 
rithm  to  control  a  fleet  of  UAVs  for  the  purpose  of  tracking  ground  targets.  The  algorithm  runs 
on  a  central  fusion  node  that  collects  measurements  generated  by  sensors  on-board  the  UAVs,  con¬ 
structs  tracks  from  those  measurements,  plans  the  future  motion  of  the  UAVs  to  maximize  tracking 
performance,  and  sends  motion  commands  back  to  the  UAVs  based  on  the  plan. 

The  focus  of  this  report  is  to  illustrate  a  design  framework  based  on  the  theory  of  partially 
observable  Markov  decision  processes  (POMDPs),  and  to  discuss  practical  issues  related  to  the 
use  of  the  framework.  With  this  in  mind,  the  problem  scenarios  presented  here  arc  idealized,  and 
arc  meant  to  illustrate  qualitative  behavior  of  a  guidance  system  design.  Moreover,  the  particular 
approximations  employed  in  the  design  arc  examples  and  can  certainly  be  improved.  Neverthe¬ 
less,  the  intent  is  to  present  a  design  approach  that  is  flexible  enough  to  admit  refinements  to 
models,  objectives,  and  approximation  methods  without  damaging  the  underlying  structure  of  the 
framework. 

Section  3  describes  the  nature  of  the  UAV  guidance  problem  addressed  here  in  more  detail, 
and  places  it  in  the  context  of  the  sensor  resource  management  literature.  The  detailed  problem 
specification  is  presented  in  Section  4,  and  our  method  for  approximating  the  solution  is  discussed 
in  Section  5.  Several  features  of  our  approach  are  already  apparent  in  the  case  of  a  single  UAV,  as 
discussed  in  Section  6.  The  method  is  extended  to  multiple  UAVs  in  Section  7,  where  coordination 
of  multiple  sensors  is  demonstrated.  In  Section  8  we  illustrate  the  flexibility  of  the  POMDP 
framework  by  modifying  it  to  include  more  complex  tracking  objectives  such  as  preventing  track 
swaps.  Finally,  we  conclude  in  Section  9  with  summary  remarks  and  future  directions. 
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3  Problem  Description 

The  class  of  problems  we  pose  in  this  report  is  a  rather  schematic  representation  of  the  UAV 
guidance  problem.  Simplifications  are  assumed  for  ease  of  presentation  and  understanding  of  the 
key  issues  involved  in  sensor  coordination.  These  simplifications  include: 

2-D  motion:  The  targets  are  assumed  to  move  in  a  plane  on  the  ground,  while  the  UAVs  are 
assumed  to  fly  at  a  constant  altitude  above  the  ground. 

position  measurements:  The  measurements  generated  by  the  sensors  are  2-D  position  measure¬ 
ments  with  associated  covariances  describing  the  position  uncertainty.  A  simplified  visual 
sensor  (camera  plus  image  processing)  is  assumed,  which  implies  that  the  angular  resolution 
is  much  better  than  the  range  resolution. 

perfect  tracker:  We  assume  that  there  arc  no  false  alarms  and  no  missed  detections,  so  exactly 
one  measurement  is  generated  for  each  target  visible  to  the  sensor.  Also,  perfect  data  associ¬ 
ation  is  usually  assumed,  so  the  tracker  knows  which  measurement  came  from  which  target, 
though  this  assumption  is  relaxed  in  Section  8  when  track  ambiguity  is  considered. 

Nevertheless,  the  problem  class  has  a.  number  of  important  features  that  influence  the  design  of  a 
good  planning  algorithm.  These  include: 

dynamic  constraints:  These  appear  in  the  form  of  constraints  on  the  motion  of  the  UAVs.  Specif¬ 
ically,  the  UAVs  fly  at  a  constant  speed  and  have  bounded  lateral  acceleration  in  the  plane, 
which  limits  their  turning  radius.  This  is  a  reasonable  model  of  the  characteristics  of  small 
fixed-wing  aircraft.  The  presence  of  dynamic  constraints  implies  that  the  planning  algorithm 
needs  to  include  some  form  of  lookahead  for  good  long-term  performance. 

randomness:  The  measurements  have  random  errors,  and  the  models  of  target  motion  arc  random 
as  well.  However,  in  most  of  our  simulations  the  actual  target  motion  is  not  random. 

spatially  varying  measurement  error:  The  range  error  of  the  sensor  is  an  affine  function  of 
the  distance  between  the  sensor  and  the  target.  The  bearing  error  of  the  sensor  is  constant, 
but  that  translates  to  a  proportional  error  in  Cartesian  space  as  well.  This  spatially  varying 
error  is  what  makes  the  sensor  placement  problem  meaningful. 

occlusions:  There  arc  occlusions  in  the  plane  that  block  the  visibility  of  targets  from  sensors 
when  they  are  on  opposite  sides  of  an  occlusion.  The  occlusions  are  generally  collections  of 
rectangles  in  our  models,  though  in  the  case  studies  presented  they  appear  more  as  walls  (thin 
rectangles).  Targets  arc  allowed  to  cross  occlusions,  and  of  course  the  UAVs  arc  allowed  (o 
fly  over  them;  their  purpose  is  only  to  make  the  observation  of  targets  more  challenging 

tracking  objectives:  The  performance  objectives  considered  here  arc  related  to  maintaining  the 
best  tracks  on  the  targets.  Normally,  that  means  minimizing  the  mean  squared  error  between 
tracks  and  targets,  but  in  Section  8  we  also  consider  the  avoidance  of  track  swaps  as  a 
performance  objective.  This  differs  from  most  of  the  guidance  literature,  where  the  objective 
is  usually  posed  as  interpolation  of  way-points. 

In  Section  4  we  demonstrate  that  the  UAV  guidance  problem  described  here  is  a  POMDP.  One 
implication  is  that  the  exact  problem  is  in  general  formally  undccidablc  [24],  so  one  must  resort 
to  approximations.  However,  another  implication  is  that  the  optimal  solution  to  this  problem  is 
characterized  by  a  form  of  Bellman’s  principle,  and  this  principle  can  be  used  as  a  basis  for  a 
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structured  approximation  of  the  optimal  solution.  In  fact,  the  main  goal  of  our  research  is  to 
demonstrate  that  the  design  of  the  UAV  guidance  system  can  be  made  practical  by  a  limited  and 
precisely  understood  use  of  heuristics  to  approximate  the  ideal  solution.  That  is,  the  heuristics 
are  used  in  such  a  way  that  their  influence  may  be  relaxed  and  the  solution  improved  as  more 
computational  resources  become  available. 

The  UAV  guidance  problem  considered  here  falls  within  the  class  of  problems  known  as  sensor 
resource  management  [29].  In  its  full  generality,  sensor  resource  management  encompasses  a  large 
body  of  problems  arising  from  the  increasing  variety  and  complexity  of  sensor  systems,  including 
dynamic  tasking  of  sensors,  dynamic  sensor  placement,  control  of  sensing  modalities  (such  as  wave¬ 
forms),  communication  resource  allocation,  and  task  scheduling  within  a  sensor  [15].  A  number  of 
approaches  have  been  proposed  to  address  the  design  of  algorithms  for  sensor  resource  management, 
which  can  be  broadly  divided  into  two  categories:  myopic  and  nonmyopic. 

Myopic  approaches  do  not  explicitly  account  for  the  future  effects  of  sensor  resource  management 
decisions  (i.e.,  there  is  no  explicit  planning  or  “lookahead”).  One  approach  within  this  category 
is  based  on  fuzzy  logic  and  expert  systems  [28],  which  exploits  operator  knowledge  to  design  a 
resource  manager.  Another  approach  uses  information- theoretic  measures  as  a  basis  for  sensor 
resource  management  [36,  11,  20].  In  this  approach,  sensor  controls  arc  determined  based  on 
maximizing  a  measure  of  “information.” 

Nonmyopic  approaches  to  sensor  resource  management  have  gained  increasing  interest  because 
of  the  need  to  account  for  the  kinds  of  requirements  described  in  this  report,  which  imply  that 
foresight  and  planning  are  crucial  for  good  long-term  performance.  In  the  context  of  UAV  coor¬ 
dination  and  control,  such  approaches  include  the  use  of  guidance  rules  [17,  21,  35,  41],  oscillator 
models  [16],  and  information-driven  coordination  [12,  34].  A  more  general  approach  to  dealing 
with  nonmyopic  resource  management  involves  stochastic  dynamic  programming  formulations  of 
the  problem  (or,  more  specifically,  POMDPs).  As  pointed  out  in  Section  5,  exact  optimal  solutions 
are  practically  infeasible  to  compute.  Therefore,  recent  effort  has  focused  on  obtaining  approximate 
solutions,  and  a  number  of  methods  have  been  developed  (e.g.,  sec  [9,  13,  14,  18,  22.  23]).  Our 
research  contributes  to  the  further  development  of  this  thrust  by  introducing  a  new  approximation 
method,  called  nominal  belief-state  optimization ,  and  applying  it  to  the  UAV  guidance  problem. 

Approximation  methods  for  POMDPs  have  been  prominent  in  the  recent  literature  on  artificial 
intelligence  (AI),  under  the  rubric  of  probabilistic  robotics  [40].  In  contrast  to  much  of  the  POMDP 
methods  in  the  AI  literature,  a  unique  feature  of  our  current  approach  is  that  the  state  and  action 
spaces  in  our  UAV  guidance  problem  formulation  is  continuous.  We  should  note  that  some  recent 
AI  efforts  have  also  treated  the  continuous  case;  e.g.,  see  [39,  30,  32],  though  in  different  settings. 

4  POMDP  Specification  and  Solution 

In  this  section  we  describe  the  mathematical  formulation  of  our  guidance  problem  as  a  partially 
observable  Markov  decision  process  (POMDP).  We  first  provide  a  general  definition  of  POMDPs. 
We  provide  this  background  exposition  for  the  sake  of  completeness  readers  who  already  have 
this  background  can  skip  this  subsection.  Then  we  proceed  to  the  specification  of  the  POMDP 
for  the  guidance  problem.  Finally,  we  discuss  the  nature  of  POMDP  solutions,  leading  up  to  a 
discussion  of  approximation  methods  in  the  next  section.  For  a  full  treatment  of  POMDPs  and 
related  background,  sec  [2].  For  a  discussion  of  POMDPs  in  sensor  management,  sec  [15]. 
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4.1  Definition  of  POMDP 

A  POMDP  is  a  controlled  dynamical  process,  useful  in  modeling  a  wide  range  of  resource  control 
problems.  To  specify  a  POMDP  model,  we  need  to  specify  the  following  components: 

•  a  set  of  states  (the  state  space)  and  a  distribution  specifying  the  random  initial  state; 

•  a  set  of  possible  actions ; 

•  a  state ‘transition  law  specifying  the  next-state  distribution  given  an  action  taken  at  a  current 
state; 

•  a  set  of  possible  observations ; 

•  an  observation  law  specifying  the  distribution  of  observations  depending  on  the  current  state 
and  possibly  the  action; 

•  a  cost  function  specifying  the  cost  (real  number)  of  being  in  a  given  state  and  taking  a  given 
action. 

In  the  next  subsection,  we  specify  these  components  for  our  guidance  problem. 

As  a  POMDP  evolves  over  time  as  a  dynamical  process,  we  do  not  have  direct  access  to  the 
states.  Instead,  all  we  have  arc  the  observations  generated  over  time,  providing  us  with  clues  of 
the  actual  underlying  states  (hence  the  term  partially  observable).  These  observations  might,  in 
some  eases,  allow  us  to  infer  exactly  what  states  actually  occurred.  However,  in  general,  there  will 
be  some  uncertainty  in  our  knowledge  of  the  states.  This  uncertainty  is  represented  by  the  belief 
state ,  which  is  the  a  posteriori  distribution  of  the  underlying  state  given  the  history  of  observations. 
The  belief  states  summarize  the  “feedback”  information  that  is  needed  for  controlling  the  system. 
Conveniently,  the  belief  state  can  easily  be  tracked  over  time  using  Bayesian  methods.  Indeed,  as 
pointed  out  below,  in  our  guidance  problem  the  belief  state  is  a  quantity  that  is  already  available 
(approximately)  as  track  states. 

Once  we  have  specified  the  above  components  of  a  POMDP,  the  guidance  problem  is  posed  as 
an  optimization  problem  where  the  expected  cumulative  cost  over  a  time  horizon  is  the  objective 
function  to  be  minimized.  The  decision  variables  in  this  optimization  problem  arc  the  actions  to  be 
applied  over  the  planning  horizon.  However,  because  of  the  stochastic  nature  of  the  problem,  the 
optimal  actions  are  not  fixed  but  are  allowed  to  depend  on  the  particular  realization  of  the  random 
variables  observed  in  the  past.  Hence,  the  optimal  solution  is  a  feedback-control  rule,  usually  called 
a  policy.  More  formally,  a  policy  is  a  mapping  that,  at  each  time,  takes  the  belief  state  and  gives 
us  a  particular  control  action,  chosen  from  the  set  of  possible  actions.  What  we  seek  is  an  optimal 
policy.  We  will  characterize  optimal  policies  in  a  later  subsection,  after  we  discuss  the  POMDP 
formulation  of  the  guidance  problem. 

4.2  POMDP  Formulation  of  Guidance  Problem 

To  formulate  our  guidance  problem  in  the  POMDP  framework  we  must  specify  each  of  the  above 
components  as  they  relate  to  the  guidance  system.  This  subsection  is  devoted  to  this  specification. 

States.  In  the  guidance  problem,  three  subsystems  must  be  accounted  for  in  specifying  the 
state  of  the  system:  the  sensor (s),  the  target (s),  and  the  tracker.  More  precisely,  the  state  at 
time  k  is  given  by  a;*  =  (sfc,  Gfci  £&>  F&),  where  represents  the  sensor  state,  £*  represents  the 
target  state,  and  (£*,  P*)  represents  the  track  state.  The  sensor  state  .s*  specifies  the  locations  and 
velocities  of  the  sensors  (UAVs)  at  time  k.  The  target  state  £*  specifies  the  locations,  velocities, 
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and  accelerations  of  the  targets  at  time  k.  Finally,  the  track  state  Pk)  represents  the  state  of 
the  tracking  algorithm:  ^  is  the  posterior  mean  vector  and  P*.  is  the  posterior  covariance  matrix, 
standard  in  Kalman  filtering  algorithms.  The  representation  of  the  state  into  a  vector  of  state 
variables  is  an  instance  of  a  factored  model  [5]. 

Action.  In  our  guidance  problem,  we  assume  a  standard  model  where  each  UAV  tiies  at  con¬ 
stant,  speed  and  its  motion  is  controlled  through  turning  controls  that  specify  lateral  instantaneous 
accelerations.  The  lateral  accelerations  can  take  values  in  an  interval  [— amax,  amax],  where  amax 
represents  a  maximum  limit  on  the  possible  lateral  acceleration.  So  the  action  at  time  k  is  given  by 
ak  €  [ —  1 ,  l]Ascn8,  where  ATsens  is  the  number  of  UAVs,  and  the  components  of  the  vector  a k  specify 
the  normalized  lateral  acceleration  of  each  UAV. 

State-transition  law.  The  state-transition  law  specifies  how  each  component  of  the  state 
changes  from  one  time  step  to  the  next.  In  general,  the  transition  law  takes  the  form 

^fc  +  I  ~  Pk{ '  |  xk) 

for  some  time- varying  distribution  However,  the  model  for  the  UAV  guidance  problem  constrains 

the  form  of  the  state  transition  law.  The  sensor  state  evolves  according  to 

where  is  Hie  map  that  defines  how  the  state  changes  from  one  time  step  to  the  next  depending 
on  the  acceleration  control  as  described  above.  The  target  state  evolves  according  to 

Cfc+l  —  /(Cfc)  +  vk 

where  Vk  represents  an  i.i.d.  random  sequence  and  /  represents  the  target  motion  model.  Most 
of  our  simulation  results  use  a  nearly  constant  velocity  (NCV)  target  motion  model,  except  for 
Section  7.2  which  uses  a  nearly  constant  acceleration  (NCA)  model.  In  all  eases  /  is  linear,  and 
is  normally  distributed.  We  write  ~  J\f(0 ,Qk)  to  indicate  the  noise  is  normal  with  zero  mean 
and  covariance  Qk- 

Finally,  the  track  state  (£fc,  Pk)  evolves  according  to  a  tracking  algorithm,  which  is  defined  by 
a  data  association  method  and  the  Kalman  filter  update  equations.  Since  our  focus  is  on  UAV 
guidance  and  not  on  practical  tracking  issues,  in  most  eases  a  ‘Truth  tracker”  is  used,  which  always 
associates  a  measurement  with  the  track  corresponding  to  the  target  being  detected.  Only  in 
Section  8  is  a  non-ideal  data  association  considered,  for  the  purpose  of  evaluating  performance 
with  ambiguous  associations. 

Observations  and  observation  law.  In  general,  the  observation  law  takes  the  form 

Zfc  ~  9fe(-  I  XI t) 

for  some  time- varying  distribution  </*.  In  our  guidance  problem,  since  the  state  has  four  separate 
components,  it  is  convenient  to  express  the  observation  with  four  corresponding  components  (a 
factored  representation).  The  sensor  state  and  track  state  arc  assumed  to  be  fully  observable.  So, 
for  these  components  of  the  state,  the  observations  arc  equal  to  the  underlying  state  components: 

4  =  sk ,  4  =  &>  zk  =  pk- 

The  target  state,  however,  is  not  directly  observable;  instead,  what  we  have  arc  random  measure¬ 
ments  of  the  target  state  that  arc  functions  of  the  locations  of  the  targets  and  the  sensors. 

Let  C£°s  and  s£os  represent  the  position  vectors  of  the  target  and  sensor,  respectively,  and  let 
h( Cfc,  Sfc)  be  a  boolean- valued  function  that  is  true  if  the  line  of  sight  from  s£°b  to  C£°s  is  unobscured 
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by  any  occlusions.  Furthermore,  we  define  a  2D  position  covariance  matrix  /?*.(£*,  s^)  that  reflects 
a  10%  uncertainty  in  the  range  from  sensor  to  target,  and  0.0 In  radian  angular  uncertainty,  where 
the  range  is  taken  to  be  at  least  10  meters.  Then  the  measurement  of  the  target  state  at  time  A*  is 
given  by 

c  _  / CfcOS  +  if  MCfc>  sk)  —  true, 

10  (no  measurement)  if  =  false, 

where  w ^  represents  an  i.i.d.  sequence  of  noise  values  distributed  according  to  the  normal  distribu- 
tion  J\f(Q,  Rk{(k- sfc))- 

Cost  function.  The  cost  function  we  most  commonly  us  in  our  guidance  problem  is  the  mean 
squared  tracking  error,  defined  by 


C(xk,ak) 


E 

Vk>Wk+  l 


IlCfc+i 


&+il 


(4-1) 


Iii  Section  8.1  we  describe  a  different  cost  function  which  we  use  for  detecting  track  ambiguity. 

Belief  state.  Although  not  a  part  of  the  POMDP  specification,  it  is  convenient  at  this  point 
to  define  our  notation  for  the  belief  state  for  the  guidance  problem.  The  belief  state  at  time  k  is 
given  by 

h  =  (bibi,biX) 

where 


K(s)  =  S(s  ~  sk) 

bi  updated  with  rjj  using  Bayes  theorem 

bUo  =  6(t  -  a) 

b£(P)  =  S(P-Pk). 

Note  that  those  components  of  the  state  that  are  directly  observable  have  delta  functions  repre¬ 
senting  their  corresponding  belief-state  components. 

We  have  deliberately  distinguished  between  the  belief  state  and  the  track  state  (the  internal 
state  of  the  tracker).  The  reason  for  this  distinction  is  so  that  the  model  is  general  enough  to 
accommodate  a  variety  of  tracking  algorithms,  even  those  that  are  acknowledged  to  be  severe 
approximations  of  the  actual  belief  state.  For  the  purpose  of  control,  it  is  natural  to  use  the 
internal  state  of  the  tracker  as  one  of  the  inputs  to  the  controller  (and  it  is  intuitive  that  the 
eontrol  performance  would  benefit  from  the  use  of  this  information).  Therefore,  it  is  appropriate  to 
incorporate  the  track  state  into  the  the  POMDP  state  space,  even  if  this  is  not  prirna  facie  obvious. 


4.3  Optimal  Policy 

Given  the  POMDP  formulation  of  our  problem,  our  goal  is  to  select  actions  over  time  to  minimize 
the  expected  cumulative  cost  (we  take  expectation  here  because  the  cumulative  eost  is  a  random 
variable,  being  a  function  of  the  random  evolution  of  x *.).  To  be  specific,  suppose  we  are  interested 
in  the  expected  cumulative  cost  over  a  time  horizon  of  length  //:  A:  =  0,  1, .  ...  H  —  1.  The  problem 
is  to  minimize  the  cumulative  cost  over  horizon  H ,  given  by 


[H  I 


Jh  =  E 


X)  C(xk,ak) 


Lk=0 


(4.2) 
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The  goal  is  to  pick  the  actions  so  that  the  objective  function  is  minimized.  In  general,  the  action 
chosen  at  each  time  should  be  allowed  to  depend  on  the  entire  history  up  to  that  time  (i.c.,  the 
action  at  time  k  is  a  random  variable  that  is  a  function  of  all  observable  quantities  up  to  time  k). 
However,  it  turns  out  that  if  an  optimal  choice  of  such  a  sequence  of  actions  exists,  then  there  is  an 
optimal  choice  of  actions  that  depends  only  on  “belief-state  feedback.”  In  other  words,  it  suffices 
for  the  action  at  time  k  to  depend  only  on  the  belief  state  at  time  k,  as  alluded  to  before. 

Let  bk  be  the  belief  state  at  time  /c,  which  is  a  distribution  over  states, 

bk(x)  =  Pxfc(z  |  20 ,...,zk;  ao,...,ak-i) 


updated  incrementally  using  Bayes  rule.  The  objective  can  be  written  in  terms  of  belief  states: 


Jh  =  E 


H  - 1 


Y2  c(bk,ak) 


k= 0 


bn 


c(b.  a 


=  J  C(x,a)b(x ) 


dx 


(4.3) 


where  E[*  |  bv ]  represents  conditional  expectation  given  bo.  Let  B  represent  the  set  of  possible 
belief  states,  and  A  the  set  of  possible  actions.  So  what  we  seek  is,  at  each  time  k ,  a  mapping 
7rj*  :  B  — ♦  A  such  that  if  we  perform  action  =  7r£(6fc),  then  the  resulting  objective  function  is 
minimized.  This  is  the  desired  optimal  policy. 

The  key  result  in  POMDP  theory  is  Bellman’s  principle.  Let  J^(bo)  be  the  optimal  objective 
function  value  (over  horizon  H)  with  bo  as  the  initial  belief  state.  Then,  Bellmans  principle  states 
that 

tto(6o)  =  argmin  {c(60,a)  +  E[J#  _,(£>i)  |  60,a]  } 

a 

is  an  optimal  policy,  where  b\  is  the  random  next  belief  state  (with  distribution  depending  on  a), 
E[  bo.  a]  represents  conditional  expectation  (given  bo  and  action  a)  with  respect  to  the  random 
next  state  6i,  and  is  the  optimal  cumulative  cost  over  the  time  horizon  1, . . .  H  starting 

with  belief  state  b\. 

Define  the  Q -value  of  taking  action  a  at  state  bo  as 


QH(bo,a)  =  c(bo.a)  +  E[Jj/_1(b1)  j  bo,  a]  . 

Then,  Bellman’s  principle  can  be  rewritten  as 

*o(&o)  =  argmin  Q//(60,  a), 

a 

i.c.,  the  optimal  action  at  belief  state  bo  is  the  one  with  smallest  Q-value  at  that  belief  state.  Thus, 
Bpllman’s  principle  instructs  us  to  minimize  a  modified  cost  function  (Qh)  that  includes  the  term 
E  j  x]  indicating  the  expected  future  cost  of  an  action;  this  term  is  called  the  expected  cost-to-go 
(ECTG).  By  minimizing  the  Q- value  that  includes  the  ECTG,  the  resulting  policy  has  a  lookahead 
property  that  is  a  common  theme  among  POMDP  solution  approaches. 

For  the  optimal  action  at  the  next  belief  state  &i,  we  would  similarly  define  the  Q- value 

Q/f-l(bi,a)  =  c(bi,a)  +  E[j^_2(b2)  \  bi,  a]  , 

where  62  is  the  random  next  belief  state  and  _2(&2)  is  the  optimal  cumulative  cost  over  the  time 
horizon  2 starting  with  belief  state  62.  Bellman’s  principle  then  states  that  the  optimal 
action  is  given  by 

7Tj(6i)  =  argmin  Q  H-\(bi,a). 
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A  common  approach  in  on-line  optimization-based  control  is  to  assume  that  the  horizon  is  long 
enough  that  the  difference  between  Qh  and  Qh- i  is  negligible.  This  has  two  implications:  first,  the 
time- varying  optimal  policy  7t£  may  be  approximated  by  a  stationary  policy,  denoted  7r*;  second, 
the  optimal  policy  is  given  by 

7 r*(6)  =  argmin  Q//(b,  a), 

a 

where  now  the  horizon  is  fixed  at  H  regardless  of  the  current  time  k.  This  approach  is  called  receding 
horizon  control ,  and  is  practically  appealing  because  it  provides  lookahead  capability  without  the 
technical  difficulty  of  infinite-horizon  control.  Moreover,  there  is  usually  a  practical  limit  to  how 
far  models  may  be  usefully  predicted.  Henceforth  we  will  assume  the  horizon  length  is  constant 
and  drop  it  from  our  notation. 

In  summary,  we  seek  a  policy  i r*(6)  that,  for  a  given  belief  state  6,  returns  the  action  a  that 
minimizes  Q(b,  a),  which  in  the  receding-horizon  case  is 

Q{b,a)  =  c(6,a)  +  E  J*(b')  |  6, a]  , 

where  V  is  the  (random)  belief  state  after  applying  action  a  at  belief  state  6,  and  c(b,a)  is  the 
associated  cost.  The  second  term  in  the  Q-value  is  in  general  difficult  to  obtain,  especially  because 
the  belief-state  space  is  large.  For  this  reason,  approximation  methods  arc  necessary.  In  the  next 
section,  we  describe  our  algorithm  for  approximating  argmina  Q(b,  a). 

We  should  re-emphasize  here  that  the  action  space  in  our  UAV  guidance  problem  is  a  hyper- 
cube,  which  is  a  continuous  space  of  possible  actions.  The  optimization  involved  in  performing 
argmina  Q(b,  a)  therefore  involves  a  search  algorithm  over  this  hypcrcube.  The  focus  of  our  re¬ 
search  is  on  a  new  method  to  approximate  Q(b,  a)  and  not  on  how  to  minimize  it.  Therefore,  we 
simply  use  a  generic  search  method  to  perform  the  minimization.  More  specifically,  in  our  simula¬ 
tion  studies,  we  used  Matlab’s  fmincon  function.  We  should  point  out  that  in  related  work,  other 
authors  have  considered  the  problem  of  designing  a  good  search  algorithm  (c.g.,  [33]). 

5  Approximation  Method 

There  arc  two  aspects  of  a  general  POMDP  that  make  it  intractable  to  solve  exactly.  First,  it 
is  a  stochastic  control  problem,  so  the  dynamics  arc  properly  understood  as  constraints  on  dis¬ 
tributions  over  the  state  space,  which  are  infinite-dimensional  in  the  case  of  a  continuous  state 
space  as  in  our  tracking  application.  In  practice,  solution  methods  for  Markov  decision  processes 
employ  some  parametric  representation  or  nonparamctric  (i.c.,  Monte  Carlo  or  ‘‘particle')  repre¬ 
sentation  of  the  distribution,  to  reduce  the  problem  to  a  finite-dimensional  one.  Intelligent  choices 
of  finite-dimensional  approximations  are  derived  from  Bellman’s  principle  characterizing  the  opti¬ 
mal  solution.  POMDPs,  however  have  the  additional  complication  that  the  state  space  itself  is 
infinite-dimensional,  since  it  includes  the  belief  state  which  is  a  distribution;  hence,  the  belief  state 
must  also  be  approximated  by  some  finite-dimensional  representation  In  Section  5.1  we  present,  a 
finite-dimensional  approximation  to  the  problem  called  nominal  belief-state  optimization  (NBO), 
which  takes  advantage  of  the  particular  structure  of  the  tracking  objective  in  our  application. 

Secondly,  in  the  interest  of  long-term  performance,  the  objective  of  a  POMDP  is  often  stated 
over  an  arbitrarily  long  or  infinite  horizon.  This  difficulty  is  typically  addressed  by  truncating  the 
horizon  to  a  finite  length,  the  effect  of  which  is  discussed  in  Section  5.2. 

Before  proceeding  to  the  detailed  description  of  our  NBO  approach,  we  first  make  two  sim¬ 
plifying  approximations  that  follow  from  standard  assumptions  for  tracking  problems.  The  first 
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approximation,  which  follows  from  the  assumption  of  a  correct  tracking  model  and  Gaussian  statis¬ 
tics,  is  that  the  belief-state  component  for  the  target  can  be  expressed  as 

i(o=m-tk,Pk)  (5.i) 


and  can  be  updated  using  (extended)  Kalman  filtering.  We  adopt  this  approximation  for  the 
remainder  of  this  report.  The  second  approximation,  which  follows  from  the  additional  assumption 
of  correct  data  association,  is  that  the  cost  function  can  be  written  as 


[bkiQ'k)  —  /  E  jjCfc  +  l  1  |  Sfc?  C  dfi 

J  vk>wk  + 1  l 


(>1(0  d( 


=  T \P, 


*+l- 


(5.2) 


In  Section  8,  wc  study  the  impact  of  this  approximation  in  the  context  of  tracking  with  data 
association  ambiguity  (i.e.,  when  wc  do  not  necessarily  have  the  correct  data  association),  and 
consider  a  different  cost  function  that  explicitly  takes  into  account  the  data  association  ambiguity. 


5.1  Nominal  Belief-State  Optimization  (NBO) 

A  number  of  POMDP  approximation  methods  have  been  studied  in  the  literature.  It  is  instructive 
to  review  these  methods  briefly,  to  provide  some  context  for  our  NBO  approach.  These  methods 
either  directly  approximate  the  Q-value  Q(b.  a)  or  indirectly  approximate  the  Q-value  by  approx¬ 
imating  the  eost-to-go  and  include  heuristic  expected  eost-to-go  (ECTG)  [19],  parametric 

approximation  [4,  38],  policy  rollout  [3],  hindsight  optimization  8,  42],  and  foresight  optimization 
(also  called  open  loop  feedback  control  (OLFC))  [2].  The  following  is  a  summary  of  these  met  hods, 
exposing  the  nature  of  each  approximation  (for  a  detailed  discussion  of  these  methods  applied  to 
sensor  resource  management  problems,  see  [9]): 


heuristic  expected 

Q(b.  a) 

~  c(6,  a)  +  yAr(6,  a) 

eost-to-go  (ECTG) 

parametric  approxima¬ 

Q(b.a) 

«Q(6,a,0) 

tion 

(c.g.,  Q-lcarning) 

policy  rollout 

Q(b,a) 

^c(b,a)  +  E[Jn^(b')  b } 

hindsight  optimization 

r(b) 

«E 

min  y^c(bklak)  1  b 

L(<nch  ^  J 

foresight  optimization 

(OLFC) 

r(b) 

~  min  E  V'  c(bk,  ak)  b.  (ak)k 

(ak)k  L  ^  J 

The  notation  (njt)jt  means  the  ordered  list  (ao,ai, . . .).  Typically,  the  expectations  in  the  last  three 
methods  arc  approximated  using  Monte  Carlo  methods. 

The  NBO  approach  may  be  summarized  as 

J*(6)  «  min  V'  c(bk,ak),  (5.3) 

(*k)k  ^ 

where  (bk)k  represents  a  nominal  sequence  of  belief  states.  Thus,  it  resembles  both  the  hindsight 
and  foresight  optimization  approaches,  but  with  the  expectation  approximated  by  one  sample. 
The  reader  will  notice  that  hindsight  and  foresight  optimization  differ  in  the  order  iri  which  the 
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expectation  and  minimization  is  taken.  However,  because  NBO  involves  only  a  single  sample- 
path  (instead  of  an  expectation),  NBO  straddles  this  distinction  between  hindsight  and  foresight 
optimization. 

The  central  motivation  behind  NBO  is  computational  efficiency.  If  one  cannot  afford  to  simulate 
multiple  samples  of  the  random  noise  sequences  to  estimate  expectations,  and  only  one  realization 
can  be  chosen,  it  is  natural  to  choose  the  “nominal”  sequence  (c.g.,  maximum  likelihood  or  mean). 
The  nominal  noise  sequence  leads  to  a  nominal  belief-state  sequence  (fr/Jfc  as  a  function  of  the 
chosen  action  sequence  (a^)fc-  Note  that  in  NBO,  as  in  foresight  optimization,  the  optimization  is 
over  a  fixed  sequence  (a^)^  rather  than  a  noise-dependent  sequence  or  a  policy. 

There  arc  two  points  worth  emphasizing  about  the  NBO  approach.  First,  the  nominal  belief- 
state  sequence  is  not  fixed,  as  (5.3)  might  suggest;  rather,  the  underlying  random  variables  are 
fixed  at  nominal  values  and  the  belief  states  become  deterministic  functions  of  the  chosen  actions. 
Second,  the  expectation  implicit  in  the  incremental  cost  c(bk,ak)  (recall  (4.1)  and  (4.3))  need  not 
be  approximated  by  the  “nominal”  value.  In  fact,  for  the  mean-squared-error  cost  we  use  in  the 
tracking  application,  the  nominal  value  would  be  0.  Instead,  we  use  the  fact  that  the  expected  cost 
can  be  evaluated  analytically  by  (5.2)  under  the  previously  stated  assumptions  of  correct  tracking 
model,  Gaussian  statistics,  and  correct  data  association. 

Because  NBO  approximates  the  belief-state  evolution  but  not  the  cost  evaluation,  the  method 
is  suitable  when  the  primary  effect  of  the  randomness  appears  in  the  cost,  not  in  the  state  predic¬ 
tion  Thus,  NBO  should  perform  well  in  our  tracking  application  as  long  as  the  target  motion  is 
reasonably  predictable  with  the  tracking  model  within  the  chosen  planning  horizon. 

The  general  procedure  for  using  the  NBO  approximation  may  be  summarized  as  follows: 

1.  Write  the  state  dynamics  as  functions  of  zero-mean  noise.  For  example,  borrowing  from  the 
notation  of  Section  4.2: 


x*+i  =  /(*fc,a*)  +  vk,  Vk~SS(0,Qk) 

Zk  =  9(xk)  +  wk,  wk  ~N(Q,Rk). 

2.  Define  nominal  belief-state  sequence  (6i, . .  . ,  // _ i ) . 

h+i  =  $(bk,ak,Vk,iVk+i)  =*  bk+i  =  <*>(&*,  a/t,  0,0) 

bo  =  bo 

In  the  linear  Gaussian  case,  this  is  the  MAP  estimate  of  b 

3.  Replace  expectation  over  random  future  belief  states 

11 


MM  =  E 

oi . bn 


^c{bk,ak) 


k=\ 


with  the  sample  given  by  nominal  belief  state  sequence 

11 


•MM  ~  ^('■(bktak) 


(~)A) 


k= 1 


4.  Optimize  over  action  sequence  (ao, . . . ,  a//-i). 
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As  pointed  out  before,  because  our  focus  here  is  to  introduce  NBO  as  a  new  approximation  method, 
the  optimization  in  the  last  step  above  is  taken  to  be  a  generic  optimization  problem  that  is  solved 
using  a  generic  method.  In  our  simulation  studies,  we  used  Matlab’s  fmincon  function. 

In  the  specific  case  of  tracking,  recall  that  the  belief  state  corresponding  to  the  target  state 
(*.  is  identified  with  the  track  state  (£*,  Pfc)  according  to  (5.1).  Therefore,  the  nominal  belief  state 
bj.  evolves  according  to  the  nominal  track  state  trajectory  (£&,  Pk)  given  by  the  (extended)  Kalman 
filter  equations  with  an  exactly  zero  noise  sequence.  This  reduces  to 

^(0=v(c-6,A) 

tk+ 1  =  CtG 

A+1  =  [{FkPkF?  +  QkT 1  +  Hl+1[Rk+1(L  Sik)]-l//fc+1] _1 , 

where  the  (linearized)  target  motion  model  is  given  by 

Cfc+i  =  -ffcCfc  +  vk,  Vk  ~  Af( 0,  Qk) 

*k  =  HkQk  +  wk,  wk  ~  N(0.  Rk((,k,  sk)) . 


The  incremental  cost  given  by  the  nominal  belief  state  is  then 

Atarg 

c(h,  ak)  =  Tr  Pk+ 1  = 

i=  1 

where  A"t arg  is  the  number  of  targets. 


5.2  Finite  Horizon 


In  the  guidance  problem  we  arc  interested  in  long-term  tracking  performance.  For  the  sake  of  expo¬ 
sition,  if  we  idealize  this  problem  as  an  infinite-horizon  POMDP  (ignoring  the  attendant  technical 
complications),  Bellman’s  principle  can  be  stated  as 


r/j-i 


J*M  =  min  E 

7T 


Y,c(bkMbk))  +  J^(bH) 

k= 0 


(5.5) 


for  any  H  <  oc.  The  term  E[J^(ft//)]  is  the  expected  cost  to  go  (ECTG)  from  the  end  of  the 
horizon  H .  If  //  represents  the  practical  limit  of  horizon  length,  then  (5.5)  may  be  approximated 
in  two  ways: 


17/ -1 


J'M  ~  min  E 

7T 


^2  c(bk,ir(bk)) 
.k= 0 


r//-i 


min  E 

7T 


J2  c(bk,*(bk))  +  J(bH) 

,k=0 


(truncation) 


(HECTG). 


The  first  amounts  to  ignoring  the  ECTG  term,  and  is  often  the  approach  taken  in  the  literature.  The 
second  replaces  the  exact  ECTG  with  a  heuristic  approximation,  typically  a  gross  approximation 
that  is  quick  to  compute.  To  benefit  from  the  inclusion  of  a  heuristic  ECTG  (HECTG)  term  in  the 
cost  function  for  optimization,  J  need  only  be  a  better  estimate  of  than  a  constant.  Moreover, 
the  utility  of  the  approximation  is  in  how  well  it  rank  actions,  not  in  how  well  it  estimates  the 
ECTG.  Section  6.4  will  illustrate  the  crucial  role  this  term  can  play  in  generating  a  good  action 
policy. 
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6  Single  UAV  Case 

Wc  begin  our  assessment  of  the  performance  of  a  POMDP-based  design  with  the  simple  case  of  a 
single  UAV  and  two  targets,  where  the  two  targets  move  along  parallel  straight-line  paths.  This  is 
enough  to  demonstrate  the  qualitative  behavior  of  the  method.  It  turns  out  that  a  straightforward 
but  naive  implementation  of  the  POMDP  approach  leads  to  performance  problems,  but  these  can 
be  overcome  by  employing  an  approximate  expected  cost-to-go  (ECTG)  term  in  the  objective,  and 
a  two-phase  approach  for  the  action  search. 

6.1  Scenario  Trajectory  Plots 

First  wc  describe  what  is  depicted  in  the  scenario  trajectory  plots  that  appear  throughout  the 
remaining  sections.  See,  for  example,  Figures  6.1  and  6.2.  Target  location  at  each  measurement 
time  is  indicated  by  a  small  red  dot.  The  targets  in  most  scenarios  move  in  straight  horizontal 
lines  from  left  to  right  at  constant  speed.  The  track  covariances  arc  indicated  by  blue  ellipses  at 
each  measurement  time;  these  arc  1-sigma  ellipses  corresponding  to  the  position  component  of  the 
covariances,  centered  at  the  mean  track  position  indicated  by  a  black  dot.  (However,  this  coloring 
scheme  is  modified  in  later  sections  in  order  to  better  distinguish  between  closely  spaced  targets.) 

The  UAV  trajectory  is  plotted  as  a  thin  black  line,  with  an  arrow  periodically.  Large  X’s 
appear  on  the  tracks  that  arc  synchronized  with  the  arrows  on  the  UAV  trajectory,  to  give  a  sense 
of  relative  positions  at  any  time. 

Finally,  occlusions  are  indicated  by  thick  light  green  lines.  When  the  line  of  sight  from  a  sensor 
to  a  target  intersects  an  occlusion,  that  target  is  not  visible  from  that  sensor.  This  is  a  crude 
model  of  buildings  or  walls  that  block  the  visibility  of  certain  areas  of  the  ground  from  different 
perspectives.  It  is  not  meant  to  be  realistic,  but  serves  to  illustrate  the  effect  of  occlusions  on  the 
performance  of  the  UAV  guidance  algorithm. 

6.2  Results  with  no  ECTG 

Following  the  NBO  procedure,  our  first  design  for  guiding  the  UAV  optimizes  the  cost  function 
(5.4)  within  a  receding  horizon  approach,  issuing  only  the  command  (Iq  and  reoptimizing  at  the 
next  step.  In  the  simplest  ease,  the  policy  is  a  myopic  one:  choose  the  next  action  that  minimizes 
the  immediate  cost  at  the  next  step  based  on  current  state  information.  This  is  equivalent  to  a 
receding  horizon  approach  with  H  =  1  and  no  ECTG  term.  The  behavior  of  this  policy  in  a  scenario 
with  two  targets  moving  at  constant  velocity  along  parallel  paths  is  illustrated  in  Figure  6.1  For 
this  scenario,  the  behavior  with  H  >  1  (applying  NBO)  is  not  qualitatively  different.  The  UAV’s 
speed  is  greater  than  the  targets’,  so  the  UAV  is  forced  to  loop  or  weave  to  reduce  its  average  speed. 
Moreover,  the  UAV  tends  to  fly  over  one  target  then  the  other,  instead  of  staying  in  between.  There 
are  two  main  reasons  for  this.  First,  the  measurement  noise  is  non-isotropic,  so  it  is  beneficial  to 
observe  the  targets  from  different  angles  over  time.  Second,  the  trace  objective  is  minimized  by 
locating  the  UAV  over  the  target  with  the  greater  covariance  trace. 

To  see  this,  consider  a  simplified  one-dimensional  tracking  problem  with  stationary  targets  on 
the  real  line  with  positions  x\  and  X2,  sensor  position  y ,  and  noisy  measurement  of  target  positions 
given  by 

Zi  ~  Af(xi,  p{y  -  Xi)2  +  r)  ,  i  =  1,2. 

This  noise  model  is  analogous  to  the  relative  range  uncertainty  defined  in  Section  4.2.  If  the  current 
“track”  variances  arc  given  by  pi  and  p2,  then  the  variances  after  updating  with  the  Kalman  filter, 
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Figure  6.1:  No  occlusion  with  II  =  1 


Figure  6.2:  Gap  occlusion  with  H  =  1 


as  a  function  of  the  new  sensor  location  y ,  arc  given  by 


pt  (y)  =  (i  -  k  )pi  = 


pjy  -  ^i)2  +  r_ 

p(y  -  *i)2  +  r  +  PiP" 


*  —  1)  2, 


and  the  trace  of  the  overall  (diagonal)  covariance  is  c(y)  —  p ^  (y)  -F  It  is  not  hard  to  show 

that  if  the  targets  are  separated  enough,  c(y)  has  local  minima  at  about  y  =  x\  and  y  —  X2  with 
values  of  approximately  P2  +  P\r /(p\  -f  r)  and  p\  -f  P2i'/(P2  +  r),  respectively.  Therefore,  the  best 
location  of  the  sensor  is  at  about  X\  if  pi  >  p2 ,  and  at  about  X2  if  the  opposite  is  true. 

Thus,  the  simple  myopic  policy  behaves  in  a  nearly  optimal  manner  when  there  arc  no  occlusions. 
However,  if  occlusions  arc  introduced,  some  lookahead  (c.g.,  longer  planning  horizon)  is  necessary  to 
anticipate  the  loss  of  observations  Figure  6.2  illustrates  what  happens  when  the  planning  horizon 
is  too  short.  In  this  scenario,  there  are  two  horizontal  walls  with  a  gap  separating  them.  If  the 
UAV  cannot  cross  the  gap  within  the  planning  horizon,  there  is  no  apparent  benefit  to  moving 


Figure  6.3:  Gap  occlusion  with  H  —  4 
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Figure  6.4:  Gap  occlusion  with  H  =  4,  search  initialized  with  H  —  1  plan 


away  from  the  top  target  toward  the  bottom  target,  and  the  track  on  the  bottom  target  goes  stale. 
On  the  other  hand,  with  H  =  4  the  horizon  is  long  enough  to  realize  the  benefit  of  crossing  the 
gap,  and  the  weaving  behavior  is  recovered  (sec  Figure  6.3). 

In  addition  to  the  length  of  the  planning  horizon,  another  factor  that  can  be  important  in 
practical  performance  is  the  initialization  of  the  search  for  the  action  sequence.  The  result  of  the 
policy  of  initializing  the  four-step  action  sequence  with  the  output  of  the  myopic  plan  (II  =  1) 
is  shown  in  Figure  6.4.  The  search  fails  to  overcome  the  poor  performance  of  the  myopic  plan 
because  the  search  starts  near  a  local  minimum  (recall  that  the  trace  objective  has  local  minima  in 
the  neighborhood  of  each  target).  Bellman's  principle  depends  on  finding  the  global  minimum,  but 
our  search  is  conducted  with  a  gradient-based  algorithm  (Matlab’s  fmincon  function),  which  is 
susceptible  to  local  minima.  One  remedy  is  to  use  a  more  reliable  but  expensive  global  optimization 
algorithm.  Another  remedy,  the  one  we  chose,  is  to  use  a  more  intelligent  initialization  for  the 
search,  using  a  penalty  term  described  in  the  next  section 

6.3  Weighted  Trace  Penalty 

The  performance  failures  illustrated  in  the  previous  section  arc  due  to  the  lack  of  sensitivity  in  our 
finite-horizon  objective  function  (5.4)  to  the  cost  of  not  observing  a  target.  When  the  horizon  is  too 
short,  it  seems  futile  to  move  toward  an  unobserved  target  if  no  observations  can  be  made  within 
the  horizon.  Likewise,  if  the  action  plan  required  to  make  an  observation  on  an  occluded  target 
deviates  far  enough  from  the  initial  plan,  it  may  not  be  found  by  a  local  search  because  locally  there 
is  no  benefit  to  moving  toward  the  occluded  target.  To  produce  a  solution  closer  to  the  optimal 
infinite-horizon  policy,  the  benefit  of  initial  actions  that  move  the  UAV  closer  to  occluded  targets 
must  be  exposed  somehow. 

One  way  to  expose  that  benefit  is  to  augment  the  cost  function  with  a  term  that  explicitly 
rewards  actions  that  bring  the  UAV  closer  to  observing  an  occluded  target.  However,  such  modifi¬ 
cations  must  be  used  with  caution.  The  danger  of  simply  optimizing  a  heuristically  modified  cost 
function  is  that  the  heuristics  may  not  apply  well  in  all  situations.  Bellman’s  principle  informs 
us  of  the  proper  mechanism  to  include  a  term  modeling  a  “hidden”  long-term  cost:  the  expected 
cost-to-go  (ECTG)  term.  Indeed,  the  blame  for  poor  performance  may  be  placed  on  the  use  of 
truncation  rather  than  IIECTG  as  the  finite-horizon  approximation  to  the  infinite-horizon  cost  (see 
Section  5.2). 

In  our  tracking  application,  the  hidden  cost  is  the  growth  of  the  covariance  of  the  track  on  an 
occluded  target  while  it  remains  occluded.  We  estimate  this  growth  by  a  weighted  trace  penalty 
(WTP)  term,  which  is  a  product  of  the  current  covariance  trace  and  the  minimum  distance  to 
observability  (MDO)  for  a  currently  occluded  target,  a  term  we  define  precisely  below.  With  the 


All  Technical  Data  contained  herein  is  subject  to  the  restrictions  stated  on  the  covershect. 

UNCLASSIFIED 


N umcriea  .Corpor  at  ion 


Award  No.:  FA9550-07- 1-0360 


page  17 


□  target 


Figure  6.5:  Minimum  distance  to  observability 


UAV  moving  at  a  constant  speed,  this  is  roughly  equivalent  to  a  scaling  of  the  trace  by  the  time 
it  takes  to  observe  the  target.  When  combined  with  tile  trace  term  that  is  already  ill  the  cost 
function,  this  amounts  to  an  approximation  of  the  track  covariance  at  the  time  the  target  is  finally 
observed.  More  accurate  approximations  arc  certainly  possible,  but  this  simple  approximation  is 
sufficient  to  achieve  the  desired  effect. 

Specifically,  the  terminal  cost  or  ECTG  term  using  the  WTP  lias  the  form 

J(b)  =  JWTr(b)  :=jD(s,^)TiP\  (6.1) 

where  7  is  a  positive  constant,  i  is  the  index  of  the  worst  occluded  target 

i  =  argmax  Tr  Pl 
I  =  {i  |  £*  invisible  from  s}  , 

and  D(s,£)  is  the  MDO,  i.c.,  the  distance  from  the  sensor  location  given  by  s  to  the  closest  point 
pMD0(s,  f )  from  which  the  target  location  given  by  £  is  observable.  Figure  6.5  is  a  simple  illustration 
of  the  MDO  concept.  Given  a  single  rectangular  occlusion,  pMDO(s,£)  and  D(s,£)  can  be  found 
very  easily.  Given  multiple  rectangular  occlusions,  the  exact  MDO  is  cumbersome  to  compute,  so 
we  use  a  fast  approximation  instead.  For  each  rectangular  occlusion  j,  we  compute  PjiDO(s.  £)  and 
Dj(s,£)  as  if  j  were  the  only  occlusion.  Then  we  have  D(s,£)  >  maxj  Dj(s,£)  >  0  whenever  £  is 
occluded  from  s,  so  we  use  max,  Dj(s,£)  as  a  generally  suitable  approximation  to  Z)(s,£). 

The  reason  a  worst-case  among  the  occluded  targets  is  selected,  rather  than  including  a  term 
for  each  occluded  target,  is  that  this  forces  the  UAV  to  at  least  obtain  an  observation  on  one  target 
instead  of  being  pulled  toward  two  separate  targets  and  possibly  never  observing  either  one.  The 
true  ECTG  certainly  includes  costs  for  all  occluded  targets.  However,  given  that  the  ECTG  can 
only  be  approximated,  the  quality  of  the  approximation  is  ultimately  judged  by  whether  it  leads 
to  the  correct  ranking  of  action  plans  within  the  horizon,  and  not  by  whether  it  closely  models  t  he 
true  ECTG  value.  We  claim  that  by  applying  the  penalty  to  only  the  worst  track  covariance,  the 
chosen  actions  arc  closer  to  the  optimal  policy  than  what  would  result  by  applying  the  penalty  to 
all  occluded  tracks. 

6.4  Results  with  WTP  for  ECTG 

Let  WTP  (if)  denote  the  procedure  of  optimizing  the  NBO  cost  function  with  horizon  length  // 
plus  the  WTP  estimate  of  the  ECTG: 

H- 1 

min  Y]  c{bk,ak)  +  ./WTP(6//).  (6.2) 

k= 0 
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Figure  6.6:  Behavior  of  WTP(l) 


Initially,  we  consider  the  use  of  WTP(I)  in  two  different  roles:  adapting  the  horizon  length  and 
initializing  the  action  search.  Subsequently,  we  consider  the  effect  of  the  terminal  cost  in  \VTP(//) 
with  H  >  1 

Figure  6.6  shows  the  behavior  of  WTP(l)  on  the  gap  scenario  previously  considered,  using 
a  penalty  weight  of  just  7  =  10-().  Comparing  with  Figure  6.2,  which  has  the  same  horizon 
length  but  no  penalty  term,  we  see  that  the  WTP  has  the  desired  effect  of  forcing  the  UAV  to 
alternately  visit  each  target.  Therefore,  the  output  of  WTP(l)  is  a  reasonable  starting  point  for 
predicting  the  trajectory  arising  from  a  good  action  plan.  Since  \VTP(1)  is  really  a  form  of  Q- value 
approximation  (namely  the  heuristic  ECTG  approach  mentioned  in  the  beginning  of  Section  5.1), 
it  is  not  surprising  that  it  generates  a  nonmyopic  policy  that  outperforms  the  myopic  policy,  even 
though  both  policies  evaluate  the  incremental  cost  c  at  only  one  step. 

By  playing  out  a  sequence  of  applications  of  WTP(l)  -which  amounts  to  a  sequence  of  one- 
dimensional  optimizations— we  can  quickly  generate  a  prediction  of  sensor  motion  that  is  useful  for 
adapting  the  planning  horizon  and  initializing  the  multi-step  action  search,  potentially  mitigating 
the  effects  seen  in  Figures  6.2  and  6.4.  Thus,  we  use  a  three-step  algorithm  described  as  follows: 

1.  Generate  an  initial  action  plan  by  a  sequence  of  //max  applications  of  WTP(l). 

2.  Choose  H  to  be  the  minimum  number  of  steps  such  that  there  is  no  change  in  observability 
of  any  of  the  targets  after  that  time,  with  a  minimum  value  of  Hm jn. 

3.  Search  for  the  optimal  //-step  action  sequence,  starting  at  the  initial  plan  generated  in  step 
1 

This  can  be  considered  a  two-phase  approach,  with  the  first  two  steps  constituting  Phase  I  and  the 
third  step  being  Phase  II.  The  heuristic  role  of  WTP(l)  in  the  above  algorithm  is  appropriate  in 
the  POMDP  framework,  because  any  suboptimal  behavior  caused  by  the  heuristic  in  Phase  1  has 
a  chance  of  being  corrected  by  the  optimization  over  the  longer  horizon  in  Phase  II,  provided  Htnui 
and  //max  are  large  enough.  Figure  6.7  shows  the  effectiveness  of  using  WTP(l)  to  choose  //  and 
initialize  the  search.  In  this  test,  Hm\n  =  1  and  Hm ax  =  8,  and  the  mean  value  of  the  adaptive  //  is 
3.7,  which  corresponds  approximately  to  H  =  4  in  Figure  6.3  but  without  having  to  identify  that 
value  beforehand. 

In  practice,  however,  the  horizon  length  is  always  bounded  above  in  order  to  limit  the  computa¬ 
tion  in  any  planning  iteration  and  the  upper  bound  //max  may  sometimes  be  too  small  to  achieve 
the  desired  performance.  Figure  6.8  illustrates  such  a  scenario.  There  is  only  one  occlusion,  but  it 
is  far  enough  from  the  upper  target  that  once  the  UAV  moves  sufficiently  far  from  the  occlusion, 
the  horizon  is  too  short  to  realize  the  benefit  of  heading  toward  the  lower  target  when  minimizing 
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Figure  6.7:  WTP(l)  used  for  initialization  and  adaptive  horizon 


Figure  6.8:  Effect  of  truncated  horizon  with  no  ECTG 


the  trace  objective.  This  is  despite  the  fact  that  the  search  is  initialized  with  the  UAV  headed 
straight  down  according  to  WTP(l). 

The  remedy,  of  course,  is  to  use  WTP  as  the  ECTG  in  Phase  II,  i.c.,  to  employ  \VTP(//)  as 
in  (6.2).  The  effect  of  WTP (H)  is  depicted  in  Figure  6.9.  In  general,  the  inclusion  of  the  ECTG 
term  makes  lookahead  more  robust  to  poor  initialization  and  short  horizons. 

In  general,  we  would  not  expect  the  optimal  trajectory  to  be  symmetric  with  respect  to  the 
two  targets,  because  of  a  number  of  possible  factors,  including:  (1)  the  location  of  the  occlusions, 
and  (2)  the  dynamics  and  the  acceleration  constraints  on  the  UAV.  In  Figures  6.6  and  6.9,  we  see 
this  asymmetry  in  that  the  UAV  docs  not  spend  equal  amounts  of  time  near  the  two  targets.  In 
Figure  6.9,  the  position  of  the  occlusion  is  highly  asymmetric  in  relation  to  the  path  of  the  two 
targets  in  this  ease,  it  is  not  surprising  that  the  UAV  trajectory  is  also  asymmetric.  In  Figure  6.6, 
the  two  occlusions  arc  more  symmetric,  and  we  would  expect  a  more  symmetric  trajectory  in  the 
long  run.  However,  in  the  short  run,  the  UAV  trajectory  is  not  exactly  symmetric  because  of  the 


Figure  6.9:  Behavior  of  WTP (//)  policy 
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timing  and  direction  of  the  UAV  as  it  crosses  the  occlusion.  The  particular  timing  and  direction  of 
the  UAV  results  in  the  need  for  an  extra  loop  in  some  instances  but  not  others. 

7  Multiple  UAV  Case 

As  it  stands,  the  procedure  developed  for  the  single  UAV  case  is  ill-suited  to  the  ease  of  multiple 
UAVs,  because  the  WTP  is  defined  with  only  a  single  sensor  in  mind.  An  extension  of  the  WTP  to 
multiple  sensors  is  developed  in  Section  7.1,  and  in  Section  7.2  this  extension  is  applied  to  a  new 
scenario  to  demonstrate  the  coordination  of  two  sensors. 

7.1  Extension  of  WTP 

A  slight  modification  of  the  WTP  defined  in  (6  1)  can  certainly  be  used  as  an  ECTG  m  scenarios 
with  more  than  one  sensor,  c.g., 

j(b)  =  7minD(^,^l)TVP1  (7.1) 

3 

where  sJ  is  the  state  of  sensor  j.  However,  this  underutilizes  the  sensors,  because  only  one  sensor 
can  affect  the  ECTG.  One  would  like  the  ECTG  to  guide  two  sensors  toward  two  separate  occluded 
targets  if  it  makes  sense  to  do  so.  On  the  other  hand,  if  one  sensor  can  “cover”  two  occluded  targets 
efficiently,  there  is  no  need  to  modify  the  motion  of  a  second  sensor.  The  problem,  therefore,  is  to 
decide  which  sensor  will  receive  responsibility  for  each  occluded  target. 

It  is  natural  to  assign  the  “nearest”  sensor  to  an  occluded  target,  i.c.,  the  one  that  minimizes 
the  MDO  as  in  (7.1).  However,  to  account  for  the  effect  of  previous  assignments  to  that  sensor, 
the  MDO  should  not  be  measured  along  a  straight  line  directly  from  the  starting  position  of  the 
sensor,  but  rather,  along  the  path  the  sensor  takes  while  making  observations  on  previously  assigned 
targets.  In  the  spirit  of  the  WTP  for  a  single  sensor,  it  is  assumed  that  if  multiple  occluded  targets 
arc  assigned  to  a  sensor,  the  most  uncertain  track  (the  one  with  the  highest  covariance  trace)  is 
the  one  that  appears  in  the  WTP  and  governs  the  motion  of  the  sensor,  until  the  target  is  actually 
observed;  then,  the  next  most  uncertain  track  appears  in  the  WTP,  and  so  on.  So,  roughly  speaking, 
the  sensor  makes  observations  of  occluded  targets  in  order  of  decreasing  uncertainty. 

Therefore,  a  multiple  weighted  trace  penalty  (MWTP)  term  is  computed  according  to  the  fol¬ 
lowing  procedure: 

1.  Find  the  set  of  targets  occluded  from  all  sensors,  and  sort  in  order  of  decreasing  Tr  Pl. 

2.  Set  J  —  0,  and  Dj  =  0  for  each  sensor  j. 

3.  For  each  occluded  target  i  (in  order): 

(a)  Find  j  =  argmin,  [D3  -f  D(s^,£*)}. 

(b)  If  Dj  =  0  then  set  j  <-  j  +  ^D^,  C)  Tr  Pl. 

(c)  Set  Dj  <—  Dj  -f  D(s*  C)  and  s>  <-  pMDO(sA  f1)- 

This  procedure  is  an  approximation  in  several  respects.  First,  it  ignores  the  motion  of  the  targets 
in  the  interval  of  time  it  takes  the  sensor  to  move  from  one  pMDO  location  to  the  next.  Second, 
it  ignores  the  dynamic  constraints  of  the  UAVs.  The  total  distance  is  computed  by  a  greedy, 
suboptimal  algorithm.  None  of  these  deficiencies  is  insurmountable,  but  for  the  purpose  of  a  quick 
heuristic  ECTG  for  ranking  action  plans,  this  MWTP  is  sufficient. 
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7.2  Coordinated  sensor  motion 

Figures  7.1  7.3  show  snapshots  of  a  scenario  illustrating  the  coordination  capability  of  the  guidance 
algorithm  using  the  MWTP  from  the  previous  section  as  an  ECTG  term.  There  arc  three  targets 
(red,  blue,  and  black)  and  two  sensor  UAVs  (black  and  green).  This  scenario  also  demonstrates 
the  adaptive  horizon,  with  thin  magenta  and  orange  lines  showing  the  UAVs'  planned  Phase  I  and 
Phase  II  trajectories,  respectively,  according  to  the  current  horizon  length  H . 

Initially,  the  three  targets  arc  divided  into  two  regions  by  an  occlusion,  and  one  sensor  covers 
each  region.  At  this  point  H  =  1  is  a  sufficient  horizon.  Then  the  black  target  heads  down  and 
crosses  two  occlusions  to  enter  the  bottom  region.  In  response,  the  green  UAV  chases  after  the 
downward-bound  target,  while  the  black  UAV  moves  to  cover  both  upper  regions  the  sensors 
coordinate  to  maximize  coverage  of  the  targets.  Figure  7.2  plots  the  UAV  motion  plans  at  the 
moment  the  planner  decides  to  chase  the  downward-bound  target.  A  large  black  X  marks  the  spot 
from  which  the  green  sensor  expects  to  first  see  the  black  target.  Generally  speaking  the  longer 
the  planning  horizon,  the  earlier  the  UAVs  react  to  the  downward-bound  target,  and  the  less  time 
any  target  remains  unseen  by  a  sensor.  In  the  moment  depicted  in  the  figure,  Phase  I  has  predicted 
that  the  black  target  is  going  to  cross  the  occlusion,  and  thus  the  adaptive  horizon  has  increased 
to  //  =  6. 

Unlike  the  previous  scenarios,  this  scenario  features  random  target  motion  as  well  as  random 
measurement  noise.  This  allows  a  broader  comparison  of  performance  among  different  planning 
algorithms.  Figure  7.4  shows  a  plot  of  the  empirical  cumulative  distribution  function  (CDF)  of 
the  average  tracking  performance  of  seven  algorithms:  H  —  1  with  no  ECTG  term,  MWTP(l), 
MWTP (3),  MWTP (4),  MWTP(5),  MWTP(6),  and  MWTP(tf)  with  adaptive  //  between  1  and 
6.  The  plot  shows  that  use  of  the  approximate  ECTG  produces  substantially  better  performance. 
Without  the  MWTP  term  in  the  objective,  one  of  the  targets  (usually  the  downward-bound  one) 
is  ignored  when  it  becomes  occluded.  There  appears  to  be  a  minor  benefit  to  using  II  =  1  or 
adaptive  horizon  over  the  other  settings.  However,  one  should  not  make  too  much  of  this  apparent 
ranking.  Perturbations  of  the  problem  configuration  or  other  parameters  result  in  other  performance' 
rankings,  though  in  all  cases  MWTP  significantly  outperforms  the  pure  myopic  policy  lacking 
ECTG. 
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Figure  7.1:  Beginning  of  scenario:  sensors  cover  separate  regions 


Figure  7.2:  Transition:  sensors  coordinate  plans  to  cover  all  targets  as  one  target  moves 
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Figure  7.3:  End  of  scenario:  sensors  have  coordinated  for  maximum  coverage 
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Two  sensor,  three  target  scenario;  3000  Monte  Carlo  Runs 


Figure  7.4:  CDF  of  tracking  performance  in  multi-sensor  scenario 


8  Track  Ambiguity 

Track  accuracy  metrics  such  as  the  mean-squared-error  metric  proposed  in  Section  4.2  are  not 
the  only  measure  of  tracking  performance.  Other  considerations  such  as  track  duration  and  track 
continuity  arc  also  important.  In  particular,  when  target  ID  or  threat  class  information  is  attached 
to  a  track  through  some  separate  discrimination  process,  it  is  important  to  maintain  a  consistent 
association  between  the  track  and  the  target  it  represents.  So-called  “track  swaps”  (switches  in 
the  mapping  between  targets  and  tracks)  may  be  caused  by  incorrect  data  association  updating 
a  track  with  measurements  from  a  different  target — or  by  approximation  of  the  true  Bayesian 
update  of  the  target  state  distribution  that  the  track  state  represents.  The  latter  cause  is  mainly 
a  function  of  the  tracking  algorithm;  the  multiple  hypothesis  tracking  (MHT)  algorithm  with  an 
unlimited  hypothesis  set  represents  the  true  Bayesian  update  under  standard  assumptions  [37],  but 
any  practical  tracker  is  an  approximation  of  the  ideal.  Data  association  ambiguity,  on  the  other 
hand,  is  a  function  of  the  sensor  locations  as  well  as  the  tracker,  and  therefore  minimizing  this 
quantity  is  a  suitable  objective  in  the  UAV  guidance  problem.  In  this  section  we  demonstrate  the 
flexibility  of  the  POMDP  framework  by  augmenting  the  mean-squared-error  cost  function  with  a 
term  that  represents  the  risk  of  a  track  swap,  and  applying  the  same  basic  algorithm  to  demonstrate 
how  the  guidance  algorithm  reduces  the  probability  of  a  track  swap  in  a  scenario  where  the  targets 
arc  confusablc. 

8.1  Detecting  Ambiguity 

A  challenge  of  this  exercise  is  that  it  is  hard  to  predict  track  swaps  with  NBO,  since  the  full  spectrum 
of  uncertainty  is  not  explored.  In  the  context  of  predicting  the  performance  of  a  proposed  action 
sequence,  one  could  try  to  detect  a  track  swap  by  comparing  associations  of  predicted  track  states 
and  predicted  target  states.  This  approach  might  work  within  a  Monte  Carlo  approximation  method 
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such  as  hindsight  optimization,  foresight  optimization,  or  policy  rollout.  With  NBO,  however,  the 
only  predicted  target  state  is  the  one  that  comes  from  the  maximum  likelihood  value  of  the  predicted 
track  state,  so  the  best  data  association  will  always  be  the  “correct”  one.  We  must  resort  to  a  more 
indirect  approach,  measuring  a  quantity  that  serves  as  a  predictor  of  a  likely  track  swap. 

The  assessment  of  data  association  ambiguity  is  currently  a  topic  of  concern  in  tracking  [10], 
because  of  its  role  as  an  indicator  of  the  potential  for  error  in  track  states  and  track  identity. 
Nevertheless,  the  ambiguity  of  a  single  measurement- to- track  data  association  is  not  a  reliable 
predictor  of  track  swap.  Consider  the  case  of  two  targets  that  cross  each  other  at  an  oblique 
angle,  which  are  tracked  with  an  NCV  model  updated  with  position  measurements.  Despite  the 
complete  ambiguity  of  association  at  the  point  when  the  targets  cross,  a  track  swap  is  extremely 
unlikely  under  a  reasonable  track  update  rate  because  the  velocity  estimate  is  unaffected  by  the 
ambiguity.  Furthermore,  tracks  can  become  confusablc  after  accumulating  a  series  of  updates  with 
slightly  ambiguous  data  associations,  none  of  which  is  egregious  enough  by  itself  to  indicate  trouble. 
This  suggests  using  an  extended  period  of  data  association  ambiguity  as  a  predictor  of  track  swap; 
however,  one  can  easily  envision  a  scenario  in  which  one  or  two  misassociations  is  enough  to  cause 
a  track  swap. 

Similarity  of  target  state  distributions  (belief  states)  should  be  a  better  indicator  of  the  potential 
for  a  track  swap.  If  two  tracks  have  similar  distributions,  it  is  unlikely  that  the  targets  they  represent 
can  be  reliably  discriminated  from  each  other,  now  or  in  the  future.  For  this  approach  to  work, 
the  belief-state  updates  must  reflect  the  inherent  ambiguity  of  the  target  states.  It  will  not  suffice 
to  use  a  single-hypothesis  tracker  in  the  prediction  of  belief  states,  even  if  the  data  association  is 
coirect  as  it  is  in  the  “truth  tracker”  used  elsewhere  in  this  report.  Again,  a  full  MHT  algorithm 
is  required  to  represent  the  true  Bayesian  update  of  the  belief  states,  which  is  intractable  exactly 
when  data  association  is  ambiguous.  We  have  found  that  when  the  hypothesis  set  is  truncated  to 
a  reasonable  limit,  the  MHT  has  trouble  representing  uncertainty  over  extended  periods  of  time. 
Instead,  we  use  the  joint  probability  data  association  (JPDA)  algorithm  [1]  for  belief-state  (and 
track-state)  updates  in  this  context,  because  it  is  designed  to  represent  track  state  uncertainty  but 
in  the  compressed  representation  of  one  Gaussian  distribution  per  track. 

The  dissimilarity  between  distributions  may  be  measured  in  several  ways:  Rollback- Leibler 
divergence  (or  alpha  divergence),  Bhattacharyya  distance,  or  discordance  [31],  all  of  which  have 
closed- for m  solutions  for  Gaussians.  However,  these  measures  are  basically  average-case  measures 
of  how  often  the  state  values  from  the  two  distributions  are  within  a  small  neighborhood  of  each 
other.  It  turns  out  that  a  worst-case  metric  is  a  better  predictor  of  the  potential  for  a  track  swap. 
The  reason  for  this  is  that  track  swaps  are  more  closely  associated  with  instantaneous  ambiguities 
in  the  track  associations.  Specifically,  even  if  on  average  the  state  variables  from  two  tracks  are  not 
often  close,  even  a  single  occurrence  of  an  ambiguous  measurement  can  cause  a  track  swap.  The 
worst-case  metric  we  use  is  defined  next. 

Given  a  Gaussian  distribution  Af(p,P),  define  the  “\2  value”  as 

xl  p{x)  :=  (x  -  At)7  n), 

so-called  because  when  x  ~  Af(/i,  P)  the  quantity  has  a  \2  distribution  with  n  degrees  of  freedom, 
where  n  is  the  number  of  components  in  x.  This  is  the  square  of  the  Mahalanobis  distance  from 
/i  to  x.  We  define  a  worst-case  u\2  distance”  between  two  Gaussian  distributions  Af(p\,P\)  and 

A/"(/i2,  ^2)  as 

Dx2(m,Pi;  H2,P2)  :=  min  {(Z  |  Jxl^pjx)  <  d  and  X%,p2(x)  <  d) 

=  minmax{x^Pl(i),  xl2,p2(x)}  ■ 
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Note  that  it  makes  sense  to  compare  Pl(x)  and  p2(x)  since  they  have  the  same  distribution 
when  x  is  drawn  randomly  from  and  A 7(^2,  ^2)?  respectively.  Geometrically,  Dx 2  may  be 

interpreted  as  the  smallest  d  such  that  the  ellipsoid  level  surfaces  \\ \x  px(x)  =  d  and  xf12,p2(x)  =  ^ 
just  touch  each  other.  Analytically,  the  problem  may  be  seen  as  measuring  the  distance  between  ft\ 
and  ji2  but  using  two  different  distance  metrics.  One  way  to  use  two  different  metrics  is  to  consider 
the  set  of  points  ‘'equidistant”  from  the  two  means,  i.c.,  the  points  having  the  same  distance  from 
each  mean  using  the  applicable  Mahalanobis  distance  from  each  mean.  Then,  the  desired  distance 
is  given  by  the  equidistant  point  with  the  least  distance.  Strictly  speaking,  Dx 2  (or  its  square  root) 
is  not  a  distance  because  it  docs  not  satisfy  the  triangle  inequality,  but  it  docs  satisfy  symmetry 
and  positivity,  with  a  value  of  zero  only  when  the  means  agree. 

The  computation  of  Dx 2  is  a  quasiconvex  problem,  which  can  be  solved  with  a  bisection  method 
involving  a  generalized  eigenvalue  problem  at  each  iteration,  according  to  the  S-proccdurc  [6].  This 
is  a  rather  expensive  procedure  to  execute  as  part  of  a  single  objective  function  evaluation.  However, 
empirical  tests  revealed  that  one  of  the  upper  bounds  used  in  the  bisection  method  tends  to  be 
a  constant  factor  of  the  true  value  in  both  ambiguous  and  unambiguous  situations,  so  we  elected 
to  use  that  as  a  surrogate  for  Dx 2.  The  upper  bound  in  question  is  obtained  by  restricting  the 
problem  to  the  line  segment  between  fi\  and  \i2 : 

Dx2(m,  Pi;  /i2,C2)  :=  nun  max  {\2  P  (in  +  o(ii2  -  111)),  \l2,p2(l<\  +  <v(/'2  l'\))}  ■  (8-1) 

If  a  point  y  lies  in  two  intervals  along  that  line  segment  starting  at  opposite  ends,  and  y  has  the  same 
X2  value  d  to  each  mean,  then  surely  the  ellipsoidal  sets  given  by  p1(x)  <  d  and  \2i2  p2(x)  <  d 
intersect  because  y  is  contained  in  the  intersection.  Therefore,  d  is  an  upper  bound  on  the  minimum 
distance  such  that  there  is  an  intersection,  i.c.,  Dx2(fii,  P\;  ^2,  P2)  >  i»  A;  tl‘2  ,  Pi)-  The 

upper  bound  is  computed  by  simply  solving  a  quadratic  equation,  which  determines  the  a  £  [0, 1] 
such  that  the  two  \2  values  in  (8.1)  arc  equal. 

8.2  Benefits  of  Ambiguity  Objective 

Using  the  method  from  Section  6,  a  sensor  tracking  two  targets  will  try  to  stay  near  to  both  targets, 
indeed  between  them  if  possible,  thereby  minimizing  ambiguity  even  without  an  explicit  measure  of 
ambiguity  in  the  objective  function.  Thus  to  demonstrate  the  effect  of  a  planner  that  deliberately 
seeks  to  minimize  ambiguity  requires  a  scenario  in  which  at  least  one  sensor  is  assigned  to  track  at 
least  three  targets  on  its  own. 

The  scenario  depicted  in  Figures  8.1  and  8.2  demonstrates  a  genuine  trade  off  that  lias  to  be 
made  by  the  planner.  Two  of  the  targets  (red  and  blue)  are  traveling  very  close  to  each  other.  The 
third  (black)  target  is  far  away  from  the  other  two.  If  the  sensor  stays  near  the  two  bottom  targets 
then  it  has  a  good  chance  of  maintaining  a  clear  picture  of  which  is  which,  but  its  estimate  of  the 
top  target’s  state  remains  at  a  consistently  poor  quality.  If  the  sensor  “weaves”  between  the  top  and 
bottom  targets  then  it  can  maintain  a  more  balanced  level  of  estimated  error  amongst  the  targets, 
but  it  is  much  more  likely  to  confuse  the  identity  of  the  bottom  two.  Recall  from  Section  6.2  that 
weaving  optimizes  the  mean  squared  tracking  error  objective  when  tracking  targets  that  arc  distant 
from  each  other.  With  the  same  mean-squared-error  objective  function  (approximated  by  TrP), 
the  same  behavior  occurs  in  this  scenario  (Figure  8.1)  except  more  time  is  spent  in  the  lower  region 
with  the  two  closely  spaced  targets.  By  adding  a  term  proportional  to  l/Dx2  to  the  cost,  a  penalty 
is  placed  on  ambiguity  of  track  states,  and  as  seen  in  Figure  8.2,  the  result  is  that  the  sensor  stays 
near  the  two  bottom  targets. 

The  outcomes  shown  in  Figures  8.1  and  8.2  arc  in  fact  representative  of  the  behavior  of  the 
two  different  objective  functions  over  multiple  Monte  Carlo  runs.  Figure  8.3  plots  the  cumulative 
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One  Sensor,  Three  Target  Ambiguity  Scenario,  5000  Monte  Carlo  Runs 


Percent  of  Incorrect  Track  to  Measurement  Associations 


Figure  8.3:  CDF  of  data  association  error  rate  in  ambiguity  scenario 
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distribution  of  the  fraction  of  incorrect  data  associations  over  the  course  of  each  of  5000  Monte 
Carlo  simulations.  The  weaving  behavior  produced  by  the  trace  objective  clearly  results  in  a 
higher  proportion  of  association  errors,  However,  as  mentioned  in  Section  8.1,  individual  track  to 
measurement  associations  are  not  our  concern  per  se,  but  rather  the  correctness  of  the  final  track  to 
truth  association  after  the  targets  separate.  The  following  table  summarizes  how  frequently  tracks 
are  assigned  to  the  correct  target  ID  at  the  end  of  the  scenario.  Again,  the  benefit  of  including  Dx  > 
or  Dx 2  for  ambiguity  avoidance  is  clear: 


objective 


Tr  P 
Tv  P 

Tv  Ft  i/bx2 
Tv  P  T  j/f)x2 
1/Dx2 


H  %  correct  ID 
1  70.14% 

6  58.58% 

1  97.04% 

6  97.02% 

1  97.20% 


(Note  that  in  the  presence  of  complete  ambiguity  between  the  two  targets  on  the  bottom,  we  could 
guess  the  correct  target  ID  by  a  coin-flip  and  expect  50%  accuracy.) 

While  the  above  results  demonstrate  the  success  of  our  ambiguity  objective  in  accomplishing 
what  it  was  explicitly  designed  for,  one  additional  positive  outcome  from  this  scenario  may  be 
surprising  at  first.  The  objective  functions  that  include  ambiguity  tend  to  produce  a  better  overall 
mean  squared  tracking  error  than  the  trace  objective  alone,  as  seen  in  Figure  8.4.  The  reason  is 
the  latter  equality  in  (5.2)  assumes  correct  data  association.  In  other  words,  in  the  presence  of 
ambiguity,  the  trace  of  the  position  covariance  no  longer  represents  the  mean  squared  track  error 
relative  to  truth.  As  such,  we  can  view  the  term  l/DK2  as  a  heuristic  ECTG  the  term  plays  a 
similar  role  as  the  ECTG  term  in  Section  7  and  contributes  to  improvement  in  the  overall  tracking 
performance. 

The  histogram  in  Figure  8.5  shows  a  clear  biinodal  distribution  when  using  the  trace  objective, 
apparently  corresponding  to  the  eases  where  a  track  swap  docs  or  docs  not  occur,  respectively. 
Although  the  trace  objective  produces  good  tracking  performance  when  no  track  swap  occurs,  the 
significant  second  mode  leads  to  a  poor  average  performance  result.  In  contrast,  Figure  8.6  shows 
that  when  the  objective  includes  ambiguity,  the  mean  squared  error  is  heavily  distributed  around 
a  single  mode  with  a  very  low  weight  on  the  second  mode  (even  with  the  ambiguity  objective  the 
sensor  occasionally  produces  a  track  swap  because  constraints  on  the  UAV  motion  prohibit  it  from 
remaining  in  place  between  the  two  targets). 
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One  Sensor,  Three  Target  Ambiguity  Scenario,  5000  Monte  Carlo  Runs 


Figure  8.4:  CDF  of  RMS  track  error  in  ambiguity  scenario 


Tr  P,  H  -  1 

700 


600 


Root  Mean  Squared  Track  Position  Error 


Figure  8.5:  Histogram  of  RMS  track  error,  Tr  P  objective,  H=1 
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Tr  P  +  7/DX2,  H  =  1 


Root  Mean  Squared  Track  Position  Error 

Figure  8.6:  Histogram  of  RMS  track  error.  Tr  P  -f  y/Dx2  objective,  H=1 


9  Conclusion 

Our  main  contribution  in  this  research  is  a  demonstration  of  the  effectiveness  of  the  POMDP 
formalism  as  a  basis  for  designing  a  solution  to  a  complex  resource  management  problem.  The 
application  of  ideas  from  POMDP  theory  is  not  straightforward  because  approximations  must  be 
made  in  order  to  develop  a  practical  solution.  Nevertheless,  by  grounding  the  design  approach  in 
the  principles  of  POMDP,  we  can  preserve  the  key  advantages  of  the  theoretical  framework,  namely 
the  flexibility  to  handle  complex  models  and  objectives,  and  the  lookahead  nature  of  the  solution. 

We  have  illustrated  both  of  those  advantages  in  the  UAV  guidance  examples  presented  here. 
These  simplified  examples  were  designed  to  highlight  some  of  the  central  issues  involved  in  the 
practical  application  of  POMDP-based  design.  They  identified  the  benefit  of  a  nonmyopic  polic  y, 
the  crucial  importance  of  an  approximate  ECTG  term  in  the  objective,  the  structured  roles  that 
heuristics  can  play  in  the  algorithm  (e.g.,  adaptive  horizon  length,  search  initialization),  and  the 
ability  to  change  the  objective  without  major  redesign. 

We  have  also  presented  a  new  approximation  method  called  nominal  belief-state  optimization 
(NBO)  which  is  particularly  well-suited  to  the  tracking  application  considered  here,  because  under 
standard  assumptions  the  expected  cost  can  be  computed  analytically.  As  NBO  is  a  special  case 
of  hindsight  optimization  and  foresight  optimization,  a  design  based  on  NBO  is  easily  extended  to 
these  more  computationally  expensive  methods  if  more  accurate  representation  of  the  randomness 
of  the  problem  is  required. 

As  our  main  goal  is  to  illustrate  some  of  the  practical  issues  involved  in  applying  the  POMDP- 
based  design  approach,  the  actual  guidance  system  developed  here  is  not  meant  to  be  taken  as  the 
best  design  we  could  achieve.  There  arc  many  directions  in  which  the  algorithm  could  be  improved, 
including: 
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•  a  more  accurate  MDO  approximation; 

•  a  more  global  search  for  the  optimal  action  plan; 

•  an  adaptive  weight  on  the  ECTG  term  (which  currently  requires  sonic  tuning); 

•  a  different  parameterization  of  the  action  space  that  allows  for  longer  planning  horizons  while 
limiting  the  growth  of  the  search  space; 


•  a  limited  use  of  Monte  Carlo  methods  to  explore  alternative  futures  other  than  the  nominal 
belief-state  sequence. 

The  conclusion  we  wish  to  emphasize  is  that  the  principled  framework  of  a  POMDP-based  design 
provides  an  understanding  of  where  approximations  arc  applied,  leading  to  avenues  of  performance 
improvement  (such  as  the  ones  listed  above)  as  more  computational  resources  become  available. 
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